The Anatomy of MapReduce Jobs, Scheduling, and Performance Challenges

نویسندگان

Shouvik Bardhan

Daniel A. Menascé

چکیده

Hadoop is a leading open source tool that supports the realization of the Big Data revolution and is based on Google’s MapReduce pioneering work in the field of ultra large amount of data storage and processing. Instead of relying on expensive proprietary hardware, Hadoop clusters typically consist of hundreds or thousands of multi-core commodity machines. Instead of moving data to the processing nodes, Hadoop moves the code to the machines where the data reside, which is inherently more scalable. Hadoop can store a diversity of data types such as video files, structured or unstructured data, audio files, log files, and signal communication records. The capability to process a large amount of diverse data in a distributed and parallel fashion with built-in fault tolerance, using free software and cheap commodity hardware makes a very compelling business case for the use of Hadoop as the Big Data platform of choice for most commercial and government organizations. However, making a MapReduce job that reads in and processes terabytes of data spanning tens of hundreds of machines complete in an acceptable amount of time can be challenging as illustrated here. This paper first presents the Hadoop ecosystem in some detail and provides details of the MapReduce engine that is at Hadoop’s core. The paper then discusses the various MapReduce schedulers and their impact on performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phurti: Application and Network-aware Flow Scheduling for Mapreduce

Traffic for a typical MapReduce job in a datacenter consists of multiple network flows. Traditionally, network resources have been allocated to optimize network-level metrics such as flow completion time or throughput. Some recent schemes propose using application-aware scheduling which can reduce the average job completion time. However, most of them treat the core network as a black box with ...

متن کامل

Real-Time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Supporting real-time jobs on MapReduce systems is particularly challenging due to the heterogeneity of the environment, the load imbalance caused by skewed data blocks, as well as real-time response demands imposed by the applications. In this paper we describe our approach for scheduling real-time, skewed MapReduce jobs in heterogeneous systems. Our approach comprises the following components:...

متن کامل

A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

MapReduce is one of the most popular parallel data processing systems, and it has been widely used in many fields. As one of the most important techniques in MapReduce, task scheduling strategy is directly related to the system performance. However, in multi-user shared MapReduce environments, the existing task scheduling algorithms cannot provide high system throughput when processing batch jo...

متن کامل

Scheduling and Energy Efficiency Improvement Techniques for Hadoop Map-reduce: State of Art and Directions for Future Research

MapReduce has become ubiquitous for processing large data volume jobs. As the number and variety of jobs to be executed across heterogeneous clusters are increasing, so is the complexity of scheduling them efficiently to meet required objectives of performance. This report presents a survey of some of the MapReduce scheduling algorithms proposed for such complex scenarios. A taxonomy is provide...

متن کامل

Scheduling algorithm based on prefetching in MapReduce clusters

Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data ...

متن کامل

Column-Oriented Storage Techniques for MapReduce

Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a MapReduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

The Anatomy of MapReduce Jobs, Scheduling, and Performance Challenges

نویسندگان

چکیده

منابع مشابه

Phurti: Application and Network-aware Flow Scheduling for Mapreduce

Real-Time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments

Scheduling and Energy Efficiency Improvement Techniques for Hadoop Map-reduce: State of Art and Directions for Future Research

Scheduling algorithm based on prefetching in MapReduce clusters

Column-Oriented Storage Techniques for MapReduce

عنوان ژورنال:

اشتراک گذاری